A Hybrid Approach for Refreshing Web Page Repositories

نویسندگان

  • Mohammad Ghodsi
  • Oktie Hassanzadeh
  • Shahab Kamali
  • Morteza Monemizadeh
چکیده

Web pages change frequently and thus crawlers have to download them often. Various policies have been proposed for refreshing local copies of web pages. In this paper, we introduce a new sampling method that excels over other change detection methods in experiments. Change Frequency (CF) is a method that predicts the update frequency of the pages and achieves a better efficiency in comparison with the sampling method, in the long run. Here, we suggest an improvement to CF. We also propose a new hybrid method that is a combination of our new sampling and CF. We show how the new method succeeds in preserving the benefits of both methods while reducing their fluctuations. We then extend our algorithms to handle a more realistic condition when our limited resource is crawler’s bandwidth, not its number of pages. Our experimental results on more than two million crawled pages, demonstrate the efficiency of our algorithms keywords: web crawling, change detection, sampling, change frequency, hybrid algorithms

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hybrid Adaptive Educational Hypermedia ‎Recommender Accommodating User’s Learning ‎Style and Web Page Features‎

Personalized recommenders have proved to be of use as a solution to reduce the information overload ‎problem. Especially in Adaptive Hypermedia System, a recommender is the main module that delivers ‎suitable learning objects to learners. Recommenders suffer from the cold-start and the sparsity problems. ‎Furthermore, obtaining learner’s preferences is cumbersome. Most studies have only focused...

متن کامل

A New Hybrid Method for Web Pages Ranking in Search Engines

There are many algorithms for optimizing the search engine results, ranking takes place according to one or more parameters such as; Backward Links, Forward Links, Content, click through rate and etc. The quality and performance of these algorithms depend on the listed parameters. The ranking is one of the most important components of the search engine that represents the degree of the vitality...

متن کامل

A Novel Approach to Feature Selection Using PageRank algorithm for Web Page Classification

In this paper, a novel filter-based approach is proposed using the PageRank algorithm to select the optimal subset of features as well as to compute their weights for web page classification. To evaluate the proposed approach multiple experiments are performed using accuracy score as the main criterion on four different datasets, namely WebKB, Reuters-R8, Reuters-R52, and 20NewsGroups. By analy...

متن کامل

Data Extraction using Content-Based Handles

In this paper, we present an approach and a visual tool, called HWrap (Handle Based Wrapper), for creating web wrappers to extract data records from web pages. In our approach, we mainly rely on the visible page content to identify data regions on a web page. In our extraction algorithm, we inspired by the way a human user scans the page content for specific data. In particular, we use text fea...

متن کامل

An Efficient Approach for Near-duplicate page detection in web crawling

The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overhea...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005